Handling of Out-of-vocabulary Words in Japanese-English Machine Translation by Exploiting Parallel Corpus
نویسندگان
چکیده
A large number of loanwords and orthographic variants in Japanese pose a challenge for machine translation. In this article, we present a hybrid model for handling out-of-vocabulary words in Japanese-to-English statistical machine translation output by exploiting parallel corpus. As the Japanese writing system makes use of four different script sets (kanji, hiragana, katakana, and romaji), we treat these scripts differently. A machine transliteration model is built to transliterate out-of-vocabulary Japanese katakana words into English words. A Japanese dependency structure analyzer is employed to tackle out-of-vocabulary kanji and hiragana words. The evaluation results demonstrate that it is an effective approach for addressing out-of-vocabulary word problems and decreasing the OOVs rate in the Japanese-to-English machine translation tasks.
منابع مشابه
Exploiting Parallel Corpus for Handling Out-of-Vocabulary Words
This paper presents a hybrid model for handling out-of-vocabulary words in Japaneseto-English statistical machine translation output by exploiting parallel corpus. As the Japanese writing system makes use of four different script sets (kanji, hiragana, katakana, and romaji), we treat these scripts differently. A machine transliteration model is built to transliterate out-ofvocabulary Japanese k...
متن کاملApplying Text Categorization to Vocabulary Enhancement for Japanese-English Cross-Language Retrieval
In this paper we explore a new method for vocabulary enhancement in cross-language retrieval. The focus is on whether we can improve upon dictionary-based retrieval, machine translation of queries, or the use of a bilingual lexicon derived from parallel corpus alignment. All experiments are done with the NACSIS collection of Japanese scientific abstracts with titles and author-assigned keywords...
متن کاملCMU Haitian Creole-English Translation System for WMT 2011
This paper describes the statistical machine translation system submitted to the WMT11 Featured Translation Task, which involves translating Haitian Creole SMS messages into English. In our experiments we try to address the issue of noise in the training data, as well as the lack of parallel training data. Spelling normalization is applied to reduce out-of-vocabulary words in the corpus. Using ...
متن کاملAssamese-English Bilingual Machine Translation
Machine translation is the process of translating text from one language to another. In this paper, Statistical Machine Translation is done on Assamese and English language by taking their respective parallel corpus. A statistical phrase based translation toolkit Moses is used here. To develop the language model and to align the words we used two another tools IRSTLM, GIZA respectively. BLEU sc...
متن کاملImproving Japanese-to-English Neural Machine Translation by Paraphrasing the Target Language
Neural machine translation (NMT) produces sentences that are more fluent than those produced by statistical machine translation (SMT). However, NMT has a very high computational cost because of the high dimensionality of the output layer. Generally, NMT restricts the size of the vocabulary, which results in infrequent words being treated as out-of-vocabulary (OOV) and degrades the performance o...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Int. J. of Asian Lang. Proc.
دوره 23 شماره
صفحات -
تاریخ انتشار 2015